# Initialize Otter
import otter
grader = otter.Notebook("hw5.ipynb")
Group: Helena Sokolovska and Marvel Hariadi¶
CPSC 330 - Applied Machine Learning¶
Homework 5: Putting it all together¶
Associated lectures: All material till lecture 13¶
Due date: Monday, Mar 10, 11:59 pm
Table of contents¶
- Submission instructions
- Understanding the problem
- Data splitting
- EDA
- Feature engineering
- Preprocessing and transformations
- Baseline model
- Linear models
- Different models
- Feature selection
- Hyperparameter optimization
- Interpretation and feature importances
- Results on the test set
- Summary of the results
- Your takeaway from the course
Submission instructions¶
rubric={points:4}
You may work with a partner on this homework and submit your assignment as a group. Below are some instructions on working as a group.
- The maximum group size is 2.
- Use group work as an opportunity to collaborate and learn new things from each other.
- Be respectful to each other and make sure you understand all the concepts in the assignment well.
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline.
- You can find the instructions on how to do group submission on Gradescope here.
- If you would like to use late tokens for the homework, all group members must have the necessary late tokens available. Please note that the late tokens will be counted for all members of the group.
Follow the homework submission instructions.
- Before submitting the assignment, run all cells in your notebook to make sure there are no errors by doing
Kernel -> Restart Kernel and Clear All Outputsand thenRun -> Run All Cells. - Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
- Follow the CPSC 330 homework instructions, which include information on how to do your assignment and how to submit your assignment.
- Upload your solution on Gradescope. Check out this Gradescope Student Guide if you need help with Gradescope submission.
- Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope.
Note: The assignments will get gradually more open-ended as we progress through the course. In many cases, there won't be a single correct solution. Sometimes you will have to make your own choices and your own decisions (for example, on what parameter values to use when they are not explicitly provided in the instructions). Use your own judgment in such cases and justify your choices, if necessary.
A tuned decision tree (max_depth=3) appears to be the best model (Accuracy: 0.8043) for predicting whether or not clients will default, however it is not much better than the dummy model (CV scores: 0.777 vs. 0.808 +/- 0.003). This is further evidenced by the poor recall, precision, f1, and AP score (Precision: 0.5994, Recall: 0.2895, f1: 0.3904, AP: 0.45).
Imports¶
Imports
Points: 0
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.inspection import permutation_importance
from sklearn.model_selection import (
GridSearchCV,
RandomizedSearchCV,
cross_val_score,
cross_validate,
cross_val_predict,
train_test_split,
)
import shap
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SequentialFeatureSelector
Introduction ¶
In this homework you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.
A few notes and tips when you work on this mini-project:
Tips¶
- This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary.
- Do not include everything you ever tried in your submission -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code.
- If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions.
Assessment¶
We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results. For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.
A final note¶
Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (15-20 hours???) is a good guideline for this project . Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well.
1. Pick your problem and explain the prediction problem ¶
rubric={points:3}
In this mini project, you have the option to choose on which dataset you will be working on. The tasks you will need to carry on will be similar, independently of your choice.
Option 1¶
You can choose to work on a classification problem of predicting whether a credit card client will default or not. For this problem, you will use Default of Credit Card Clients Dataset. In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with the associated research paper, which is available through the UBC library.
Option 2¶
You can choose to work on a regression problem using a dataset of New York City Airbnb listings from 2019. As usual, you'll need to start by downloading the dataset, then you will try to predict reviews_per_month, as a proxy for the popularity of the listing. Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.
Note there is an updated version of this dataset with more features available here. The features were are using in
listings.csv.gzfor the New York city datasets. You will also see some other files likereviews.csv.gz. For your own interest you may want to explore the expanded dataset and try your analysis there. However, please submit your results on the dataset obtained from Kaggle.
Your tasks:
- Spend some time understanding the options and pick the one you find more interesting (it may help spending some time looking at the documentation available on Kaggle for each dataset).
- After making your choice, focus on understanding the problem and what each feature means, again using the documentation on the dataset page on Kaggle. Write a few sentences on your initial thoughts on the problem and the dataset.
- Download the dataset and read it as a pandas dataframe.
Solution_1
Points: 3
We went with option 1: predicting whether or not credit card clients will default their bills.
We think there are also a lot of unique traits to the data that comes from analyzing the cultural context of Taiwan. For example, the possible values for marital status are married, single, or others. They do not track engagement, or dating, common-law partnerships. All this information would be tracked in Canada, but in this dataset these statuses are grouped all together under the "other" bucket. We also found it very interesting that in the education field, education uses ordinal encoding but level 5 & 6 are both listed as "unknown." The Kaggle docs specifically states the following:
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
We may need to do some feature manipulation to address this issue.
df = pd.read_csv("UCI_Credit_Card.csv")
df.head()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 20000.0 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 1 | 2 | 120000.0 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ... | 3272.0 | 3455.0 | 3261.0 | 0.0 | 1000.0 | 1000.0 | 1000.0 | 0.0 | 2000.0 | 1 |
| 2 | 3 | 90000.0 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ... | 14331.0 | 14948.0 | 15549.0 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 |
| 3 | 4 | 50000.0 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | ... | 28314.0 | 28959.0 | 29547.0 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 |
| 4 | 5 | 50000.0 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | ... | 20940.0 | 19146.0 | 19131.0 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 |
5 rows × 25 columns
2. Data splitting ¶
rubric={points:2}
Your tasks:
- Split the data into train (70%) and test (30%) portions with
random_state=123.
If your computer cannot handle training on 70% training data, make the test split bigger.
Solution_2
Points: 2
train_df, test_df = train_test_split(df, test_size=0.3, random_state=123)
3. EDA ¶
rubric={points:10}
Your tasks:
- Perform exploratory data analysis on the train set.
- Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
- Summarize your initial observations about the data.
- Pick appropriate metric/metrics for assessment.
Solution_3
Points: 10
train_df.dtypes
ID int64 LIMIT_BAL float64 SEX int64 EDUCATION int64 MARRIAGE int64 AGE int64 PAY_0 int64 PAY_2 int64 PAY_3 int64 PAY_4 int64 PAY_5 int64 PAY_6 int64 BILL_AMT1 float64 BILL_AMT2 float64 BILL_AMT3 float64 BILL_AMT4 float64 BILL_AMT5 float64 BILL_AMT6 float64 PAY_AMT1 float64 PAY_AMT2 float64 PAY_AMT3 float64 PAY_AMT4 float64 PAY_AMT5 float64 PAY_AMT6 float64 default.payment.next.month int64 dtype: object
train_df["default.payment.next.month"].value_counts(normalize=True)
default.payment.next.month 0 0.776762 1 0.223238 Name: proportion, dtype: float64
There is a class imbalance in the target data.
train_df.describe()
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 21000.000000 | 21000.000000 | 21000.000000 | 21000.000000 | 21000.000000 | 21000.000000 | 21000.000000 | 21000.000000 | 21000.000000 | 21000.000000 | ... | 21000.000000 | 21000.000000 | 21000.000000 | 21000.000000 | 2.100000e+04 | 21000.000000 | 21000.000000 | 21000.000000 | 21000.000000 | 21000.000000 |
| mean | 14962.348238 | 167880.651429 | 1.600762 | 1.852143 | 1.554000 | 35.500810 | -0.015429 | -0.137095 | -0.171619 | -0.225238 | ... | 43486.610905 | 40428.518333 | 38767.202667 | 5673.585143 | 5.895027e+03 | 5311.432286 | 4774.021381 | 4751.850095 | 5237.762190 | 0.223238 |
| std | 8650.734050 | 130202.682167 | 0.489753 | 0.792961 | 0.521675 | 9.212644 | 1.120465 | 1.194506 | 1.196123 | 1.168556 | ... | 64843.303993 | 61187.200817 | 59587.689549 | 17033.241454 | 2.180143e+04 | 18377.997079 | 15434.136142 | 15228.193125 | 18116.846563 | 0.416427 |
| min | 1.000000 | 10000.000000 | 1.000000 | 0.000000 | 0.000000 | 21.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | ... | -50616.000000 | -61372.000000 | -339603.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 7498.750000 | 50000.000000 | 1.000000 | 1.000000 | 1.000000 | 28.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | ... | 2293.750000 | 1739.500000 | 1215.750000 | 1000.000000 | 8.200000e+02 | 390.000000 | 266.000000 | 234.000000 | 110.750000 | 0.000000 |
| 50% | 14960.500000 | 140000.000000 | 2.000000 | 2.000000 | 2.000000 | 34.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 19102.500000 | 18083.000000 | 16854.500000 | 2100.000000 | 2.007000e+03 | 1809.500000 | 1500.000000 | 1500.000000 | 1500.000000 | 0.000000 |
| 75% | 22458.250000 | 240000.000000 | 2.000000 | 2.000000 | 2.000000 | 41.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 54763.250000 | 50491.000000 | 49253.750000 | 5007.250000 | 5.000000e+03 | 4628.500000 | 4021.250000 | 4016.000000 | 4000.000000 | 0.000000 |
| max | 30000.000000 | 1000000.000000 | 2.000000 | 6.000000 | 3.000000 | 79.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | ... | 891586.000000 | 927171.000000 | 961664.000000 | 873552.000000 | 1.227082e+06 | 896040.000000 | 621000.000000 | 426529.000000 | 528666.000000 | 1.000000 |
8 rows × 25 columns
Summary Statistics
- Sample dataset contains 21,000 credit card clients from Taiwan
- Average credit limit is 167,880.65 TWD, average default rate in the next month is 22.32% (mean default.payment.next.month = 0.2232)
- Demographics: majority sex is female (2) at 1.60, average education is leaning towards university (2) education at 1.85, marital status is leaning towards single (2) at 1.55, and average age is 35 years
# wrangling data so PAY, BILL_AMT, and PAY_AMT columns across the 6 months are combined
# change inconsistent naming
train_df.rename(columns={"PAY_0": "PAY_1"}, inplace=True)
train_df.rename(columns={"default.payment.next.month": "default_payment_next_month"}, inplace=True)
# Convert PAY_X columns to long format
pay_df = train_df.melt(id_vars=['ID'], value_vars=['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'],
var_name="Month", value_name="Repayment Status")
pay_df['Month'] = pay_df['Month'].str.extract('(\d+)') # Extract month number
# Convert BILL_AMTX columns to long format
bill_df = train_df.melt(id_vars=['ID'], value_vars=['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'],
var_name="Month", value_name="Bill Amount")
bill_df['Month'] = bill_df['Month'].str.extract('(\d+)') # Extract month number
# Convert PAY_AMTX columns to long format
pay_amt_df = train_df.melt(id_vars=['ID'], value_vars=['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'],
var_name="Month", value_name="Payment Amount")
pay_amt_df['Month'] = pay_amt_df['Month'].str.extract('(\d+)') # Extract month number
# Merge all three dfs on ID and Month
df_long = pay_df.merge(bill_df, on=['ID', 'Month']).merge(pay_amt_df, on=['ID', 'Month'])
# Convert Month to integer for sorting
df_long['Month'] = df_long['Month'].astype(int)
# merge df_long with remaining columns
pay_columns = ['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
bill_columns = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']
pay_amt_columns = ['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
id_vars = [col for col in train_df.columns if col not in pay_columns + bill_columns + pay_amt_columns]
df_long = df_long.merge(train_df[id_vars], on=['ID'])
month_mapping = {
1: "Sep", 2: "Aug", 3: "Jul", 4: "Jun", 5: "May", 6: "Apr"
}
# Apply the mapping to the "Month" column
df_long["Month"] = df_long["Month"].replace(month_mapping)
# display result
df_long.head()
# confirm there are 6 rows (1 per month) per ID
df_long.query("ID == 1")
<>:10: SyntaxWarning: invalid escape sequence '\d'
<>:14: SyntaxWarning: invalid escape sequence '\d'
<>:18: SyntaxWarning: invalid escape sequence '\d'
<>:10: SyntaxWarning: invalid escape sequence '\d'
<>:14: SyntaxWarning: invalid escape sequence '\d'
<>:18: SyntaxWarning: invalid escape sequence '\d'
C:\Users\Helena\AppData\Local\Temp\ipykernel_16040\3940617332.py:10: SyntaxWarning: invalid escape sequence '\d'
pay_df['Month'] = pay_df['Month'].str.extract('(\d+)') # Extract month number
C:\Users\Helena\AppData\Local\Temp\ipykernel_16040\3940617332.py:14: SyntaxWarning: invalid escape sequence '\d'
bill_df['Month'] = bill_df['Month'].str.extract('(\d+)') # Extract month number
C:\Users\Helena\AppData\Local\Temp\ipykernel_16040\3940617332.py:18: SyntaxWarning: invalid escape sequence '\d'
pay_amt_df['Month'] = pay_amt_df['Month'].str.extract('(\d+)') # Extract month number
| ID | Month | Repayment Status | Bill Amount | Payment Amount | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | default_payment_next_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 12426 | 1 | Sep | 2 | 3913.0 | 0.0 | 20000.0 | 2 | 2 | 1 | 24 | 1 |
| 33426 | 1 | Aug | 2 | 3102.0 | 689.0 | 20000.0 | 2 | 2 | 1 | 24 | 1 |
| 54426 | 1 | Jul | -1 | 689.0 | 0.0 | 20000.0 | 2 | 2 | 1 | 24 | 1 |
| 75426 | 1 | Jun | -1 | 0.0 | 0.0 | 20000.0 | 2 | 2 | 1 | 24 | 1 |
| 96426 | 1 | May | -2 | 0.0 | 0.0 | 20000.0 | 2 | 2 | 1 | 24 | 1 |
| 117426 | 1 | Apr | -2 | 0.0 | 0.0 | 20000.0 | 2 | 2 | 1 | 24 | 1 |
alt.data_transformers.disable_max_rows()
df_demo = df_long.drop(columns=["ID", "Month"])
df_demo_X = df_demo.drop(columns=["default_payment_next_month"])
demo = alt.Chart(df_demo).mark_bar().encode(
alt.X(alt.repeat('row'), type='nominal'),
alt.Y(alt.repeat('column'), aggregate='average', type='quantitative'),
alt.Tooltip(alt.repeat('column'), aggregate='average', type='quantitative')
).properties(
width=150,
height=150
).repeat(
row=["default_payment_next_month"],
column=df_demo_X.columns
)
demo.properties(title = "Comparison Between Credit Card Clients that Did vs. Did Not Default in October")